Credit Card Users Churn Prediction - Connie Xavier¶

Problem Definition and Information¶

Context¶

The Thera bank recently saw a steep decline in the number of credit card users. Credit cards are a good source of income for banks because of different kinds of fees charged by the banks. Customers’ leaving credit cards services would lead to a loss for the bank, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and understand the reasons so that bank could improve upon those areas.

As a Data scientist at Thera bank, I need to come up with a classification model that will help the bank identify the potential customers who have a higher probability of renouncing their credit cards and provide recommendations to the bank to improve its services.

Objective¶

To explore and visualize the data, build an optimized classification model to predict if a customer will renounce credit card services, and generate a set of insights and recommendations that will help the bank.

Key Questions¶

  1. Is there a good classification model to predict whether a customer will renounce their credit card? What does the performance assessment look like for such a model?
  2. What are the key factors influencing whether a customer churns?
  3. What improvements can the bank make to keep credit card users?

Data Information¶

Each record in the database represents a customer's information. A detailed data dictionary can be found below.

Data Dictionary

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  • Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  • Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  • Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

---------------------------------------------------------------------------------------------------------------¶

Part 1: Overview of Data¶

In [664]:
# import relevant libraries
# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder


# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black
In [665]:
# load the data
data = pd.read_csv("BankChurners.csv")
In [666]:
# check a sample of the data to make sure it came in correctly
data.sample(n=10, random_state=1)
Out[666]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
6498 712389108 Existing Customer 43 F 2 Graduate Married Less than $40K Blue 36 6 3 2 2570.000 2107 463.000 0.651 4058 83 0.766 0.820
9013 718388733 Existing Customer 38 F 1 College NaN Less than $40K Blue 32 2 3 3 2609.000 1259 1350.000 0.871 8677 96 0.627 0.483
2053 710109633 Existing Customer 39 M 2 College Married $60K - $80K Blue 31 6 3 2 9871.000 1061 8810.000 0.545 1683 34 0.478 0.107
3211 717331758 Existing Customer 44 M 4 Graduate Married $120K + Blue 32 6 3 4 34516.000 2517 31999.000 0.765 4228 83 0.596 0.073
5559 709460883 Attrited Customer 38 F 2 Doctorate Married Less than $40K Blue 28 5 2 4 1614.000 0 1614.000 0.609 2437 46 0.438 0.000
6106 789105183 Existing Customer 54 M 3 Post-Graduate Single $80K - $120K Silver 42 3 1 2 34516.000 2488 32028.000 0.552 4401 87 0.776 0.072
4150 771342183 Attrited Customer 53 F 3 Graduate Single $40K - $60K Blue 40 6 3 2 1625.000 0 1625.000 0.689 2314 43 0.433 0.000
2205 708174708 Existing Customer 38 M 4 Graduate Married $40K - $60K Blue 27 6 2 4 5535.000 1276 4259.000 0.636 1764 38 0.900 0.231
4145 718076733 Existing Customer 43 M 1 Graduate Single $60K - $80K Silver 31 4 3 3 25824.000 1170 24654.000 0.684 3101 73 0.780 0.045
5324 821889858 Attrited Customer 50 F 1 Doctorate Single abc Blue 46 6 4 3 1970.000 1477 493.000 0.662 2493 44 0.571 0.750
  • Looks like all the features came in.
  • CLIENTNUM appears to be unique.
  • The values in Income_Category come in different format types.
  • There are missing values shown in this sample section.
  • Credit_Limit - Total_Revolving_Bal = Avg_Open_To_Buy. Since these columns are related to each other, we do not need all of them. This is also related to Avg_Utilization_Ratio
In [667]:
# check the shape
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns in the data.")
There are 10127 rows and 21 columns in the data.
In [668]:
# check that the ID column is unique
data.CLIENTNUM.nunique()
Out[668]:
10127
  • Since all the values in the CLIENTNUM are unique, we can drop this column.
In [669]:
# checking for duplicate values
df = data.copy()
df = df.drop("CLIENTNUM", axis=1)
print(f"There are {df.duplicated().sum()} duplicated rows in the data.")
There are 0 duplicated rows in the data.
In [670]:
# check datatypes of the columns and which columns have null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           8608 non-null   object 
 5   Marital_Status            9378 non-null   object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Open_To_Buy           10127 non-null  float64
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 16  Total_Trans_Amt           10127 non-null  int64  
 17  Total_Trans_Ct            10127 non-null  int64  
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 19  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(9), object(6)
memory usage: 1.5+ MB
In [671]:
# check which columns have null values
df.isna().sum()[df.isna().sum() > 0]
Out[671]:
Education_Level    1519
Marital_Status      749
dtype: int64
  • There are null values for 2 columns. We will have to treat these missing values.
  • The dependent variable, Attrition_Flag, is of object type.
  • Gender, Education_level, Marital_Status, Income_Category, and Card_Category are also object datatypes.
  • All other variables are of float or integer type.
  • We will want to convert categorical variables to category datatype.
In [672]:
# check the unique values for the categorical variables
cat_cols = list(df.select_dtypes(include="object").columns)

for i in cat_cols:
    print(df[i].value_counts(normalize=True))
    print("-" * 50)
Existing Customer   0.839
Attrited Customer   0.161
Name: Attrition_Flag, dtype: float64
--------------------------------------------------
F   0.529
M   0.471
Name: Gender, dtype: float64
--------------------------------------------------
Graduate        0.363
High School     0.234
Uneducated      0.173
College         0.118
Post-Graduate   0.060
Doctorate       0.052
Name: Education_Level, dtype: float64
--------------------------------------------------
Married    0.500
Single     0.420
Divorced   0.080
Name: Marital_Status, dtype: float64
--------------------------------------------------
Less than $40K   0.352
$40K - $60K      0.177
$80K - $120K     0.152
$60K - $80K      0.138
abc              0.110
$120K +          0.072
Name: Income_Category, dtype: float64
--------------------------------------------------
Blue       0.932
Silver     0.055
Gold       0.011
Platinum   0.002
Name: Card_Category, dtype: float64
--------------------------------------------------
  • About 16% of customers have churned their credit cards.
  • One of the Income_Category types is 'abc' which is not an income. We can replace this with an unknown category.
  • All other columns take on an expected range of values.
In [673]:
# replace values
df.Income_Category.replace("abc", "Unknown", inplace=True)
df.Income_Category.value_counts()
Out[673]:
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
Unknown           1112
$120K +            727
Name: Income_Category, dtype: int64
In [674]:
# convert object types to category type
df[cat_cols] = df[cat_cols].astype("category")
In [675]:
# look at the statistical summary of the data
df.describe(include="all").T
Out[675]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 10127 2 Existing Customer 8500 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 10127.000 NaN NaN NaN 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Gender 10127 2 F 5358 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 10127.000 NaN NaN NaN 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Education_Level 8608 6 Graduate 3128 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 9378 3 Married 4687 NaN NaN NaN NaN NaN NaN NaN
Income_Category 10127 6 Less than $40K 3561 NaN NaN NaN NaN NaN NaN NaN
Card_Category 10127 4 Blue 9436 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 10127.000 NaN NaN NaN 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 NaN NaN NaN 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 NaN NaN NaN 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 NaN NaN NaN 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 NaN NaN NaN 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 NaN NaN NaN 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 NaN NaN NaN 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 NaN NaN NaN 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 NaN NaN NaN 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 NaN NaN NaN 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 NaN NaN NaN 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 NaN NaN NaN 0.275 0.276 0.000 0.023 0.176 0.503 0.999

Significant observations:

  • Most customers are Female, Graduate level, Married, make less than 40k, and have the Blue card.
  • Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon have mean and median values that are close in value.
  • Credit_Limit, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1 appear skewed to the right since there is a large gap between the 75% and max values.
  • Some accounts have a 0 Total_Revolving_Bal.
  • Dependent_count ranges from 0 to 5, with the mean of 2 dependents.
  • Customers range from 26 to 73 years old.
  • The max of Months_Inactive_12_mon is 6, which is less than 12, so it is reasonable.
  • Avg_Utilization_Ratio ranges from 0 to 1 which is reasonable.

Part 2: Exploratory Data Analysis and Data Preprocessing¶

Exploratory data analysis and data preprocessing involving missing value and outlier detection and treatment often depend on each other, so they are included together below.

Univariate Analysis¶

In [676]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [677]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [678]:
## plot histogram and boxplot for the numerical features
num_cols = list(df.select_dtypes(include=["float", "int"]).columns)
for i in num_cols:
    print(i)
    histogram_boxplot(df, i)
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
Customer_Age
 ****************************************************************** 
Dependent_count
 ****************************************************************** 
Months_on_book
 ****************************************************************** 
Total_Relationship_Count
 ****************************************************************** 
Months_Inactive_12_mon
 ****************************************************************** 
Contacts_Count_12_mon
 ****************************************************************** 
Credit_Limit
 ****************************************************************** 
Total_Revolving_Bal
 ****************************************************************** 
Avg_Open_To_Buy
 ****************************************************************** 
Total_Amt_Chng_Q4_Q1
 ****************************************************************** 
Total_Trans_Amt
 ****************************************************************** 
Total_Trans_Ct
 ****************************************************************** 
Total_Ct_Chng_Q4_Q1
 ****************************************************************** 
Avg_Utilization_Ratio
 ****************************************************************** 
  • Customer_Age appears normally distributed with a couple of outliers to the right, but these are within a reasonable age range so we will not treat these.
  • Dependent_count is slightly skewed to the right. There are no outliers.
  • Months_on_book appears normally distributed, with a high peak in the center. This may mean several customers joined at the same time. There are outliers to the right and left, but these are not unreasonable points.
  • Total_Relationship_Count is slightly skewed to the left, with the mean less than the median.
  • Months_Inactive_12_mon has outliers to the right and left, but these are reasonable values.
  • Contacts_Count_12_mon is slightly skewed to the right.
  • Credit_Limit is severely skewed to the right, with several outliers to the right. There are a lot of customers with a high credit limit (~35,000).
  • Total_Revolving_Bal has no outliers, but peaks on both ends of the distribution. The peak at 0 means no balance is carried over to the next month.
  • Avg_Open_To_Buy has a similar distribution to Credit_Limit with many outliers to the right.
  • Total_Amt_Chng_Q4_Q1 has outliers to the right and left, but there are a few extreme outliers to the right. We will want to look into these further.
  • Total_Trans_Amt is right skewed, and has several peaks in the data. We will want to explore whether to treat the outliers since the distribution appears to have several smaller normal distributions within it.
  • Total_Trans_Ct is bimodal with a few outliers to the right.
  • Total_Ct_Chng_Q4_Q1 has several outliers to the right.
  • Avg_Utilization_Ratio is right-skewed with no outliers.
In [679]:
## Barplot for the categorical features
for i in cat_cols:
    print(i)
    labeled_barplot(df, i, perc=True)
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
Attrition_Flag
 ****************************************************************** 
Gender
 ****************************************************************** 
Education_Level
 ****************************************************************** 
Marital_Status
 ****************************************************************** 
Income_Category
 ****************************************************************** 
Card_Category
 ****************************************************************** 
  • 16% are attrited customers, meaning this is a heavily imbalanced dataset.
  • There are more female than male customers by about 6%.
  • Most (31%) of customers have a graduate degree.
  • A majority (46%) of customers are married.
  • A majority (35%) of customers make less than 40k.
  • Most (93%) customers hold the Blue credit card. Very few hold the Platinum credit card.

Outlier Detection and Treatment¶

We will check the variables that had extreme outliers based on the univariate distributions

In [680]:
# check higher end outliers of Credit Limit
df[df.Credit_Limit > 30000].Income_Category.value_counts()
Out[680]:
$80K - $120K      285
$120K +           239
$60K - $80K        77
Unknown            66
$40K - $60K         0
Less than $40K      0
Name: Income_Category, dtype: int64
  • No customers having the higher end credit limit are in the bottom two income categories, so we will leave these points.
In [681]:
# Treat Total_Amt_Chng_Q4_Q1 and Total_Cnt_Chng_Q4_Q1 outliers that are greater than 4*IQR from the median
# look at Total_Amt_Chng_Q4_Q1 first
quartiles = np.quantile(
    df["Total_Amt_Chng_Q4_Q1"][df["Total_Amt_Chng_Q4_Q1"].notnull()], [0.25, 0.75]
)
tot_4iqr = 4 * (quartiles[1] - quartiles[0])
outlier_tot = df.loc[
    np.abs(df["Total_Amt_Chng_Q4_Q1"] - df["Total_Amt_Chng_Q4_Q1"].median()) > tot_4iqr,
    "Total_Amt_Chng_Q4_Q1",
]
print(outlier_tot.sort_values(ascending=False).count() / df.shape[0] * 100, "%")
df.loc[outlier_tot.sort_values().index]
0.543102597017873 %
Out[681]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
95 Existing Customer 64 M 1 Graduate Married Less than $40K Blue 52 6 4 3 1709.000 895 814.000 1.656 1673 32 0.882 0.524
1883 Existing Customer 37 M 2 College Married $80K - $120K Blue 17 5 3 2 4631.000 1991 2640.000 1.669 2864 37 0.947 0.430
113 Existing Customer 54 F 0 Uneducated Married Less than $40K Blue 36 2 2 2 1494.000 706 788.000 1.674 1305 24 3.000 0.473
3270 Existing Customer 49 M 3 High School NaN $60K - $80K Blue 36 3 2 2 9551.000 1833 7718.000 1.675 3213 52 1.476 0.192
1570 Existing Customer 49 M 2 NaN Single $60K - $80K Blue 38 4 1 2 2461.000 1586 875.000 1.676 1729 35 0.750 0.644
137 Existing Customer 45 M 4 College Divorced $60K - $80K Blue 40 5 1 0 10408.000 1186 9222.000 1.689 2560 42 1.211 0.114
89 Existing Customer 57 M 2 NaN Married $120K + Blue 45 5 3 3 5266.000 0 5266.000 1.702 1516 29 1.636 0.000
94 Existing Customer 45 F 3 NaN Married Unknown Blue 28 5 1 2 2535.000 2440 95.000 1.705 1312 20 1.222 0.963
1689 Existing Customer 34 M 0 Graduate Married $60K - $80K Blue 26 4 3 3 5175.000 977 4198.000 1.705 2405 49 0.885 0.189
15 Existing Customer 44 M 4 NaN NaN $80K - $120K Blue 37 5 1 2 4234.000 972 3262.000 1.707 1348 27 1.700 0.230
336 Existing Customer 56 F 1 Graduate Married Less than $40K Blue 38 4 3 3 2578.000 2462 116.000 1.707 1378 29 0.812 0.955
16 Existing Customer 48 M 4 Post-Graduate Single $80K - $120K Blue 36 6 2 3 30367.000 2362 28005.000 1.708 1671 27 0.929 0.078
68 Existing Customer 49 M 2 Graduate Married $60K - $80K Blue 32 2 2 2 1687.000 1107 580.000 1.715 1670 17 2.400 0.656
36 Existing Customer 55 F 3 Graduate Married Less than $40K Blue 36 6 2 3 3035.000 2298 737.000 1.724 1877 37 1.176 0.757
32 Existing Customer 41 M 4 Graduate Married $60K - $80K Blue 36 4 1 2 8923.000 2517 6406.000 1.726 1589 24 1.667 0.282
231 Existing Customer 57 M 2 NaN Married $80K - $120K Blue 46 2 3 0 18871.000 1740 17131.000 1.727 1516 21 2.000 0.092
2565 Existing Customer 39 M 3 Graduate Married $120K + Blue 36 3 3 2 32964.000 2231 30733.000 1.731 3094 45 1.647 0.068
2337 Existing Customer 50 F 2 Graduate Divorced $40K - $60K Blue 40 6 2 5 8307.000 2517 5790.000 1.743 2293 36 0.800 0.303
1369 Existing Customer 36 F 2 Uneducated Married Less than $40K Blue 36 4 2 2 4066.000 1639 2427.000 1.749 3040 56 0.931 0.403
33 Existing Customer 53 F 2 College Married Less than $40K Blue 38 5 2 3 2650.000 1490 1160.000 1.750 1411 28 1.000 0.562
190 Existing Customer 57 M 1 Graduate Married $80K - $120K Blue 47 5 3 1 14612.000 1976 12636.000 1.768 1827 24 3.000 0.135
1718 Existing Customer 42 F 4 Post-Graduate Single Less than $40K Blue 36 6 2 3 1438.300 674 764.300 1.769 2451 55 1.292 0.469
1455 Existing Customer 39 F 2 Doctorate Married Unknown Blue 36 5 2 4 8058.000 791 7267.000 1.787 2742 42 2.000 0.098
180 Existing Customer 45 M 2 Uneducated Married $40K - $60K Blue 34 3 2 1 5771.000 2248 3523.000 1.791 1387 18 0.800 0.390
1486 Existing Customer 39 M 2 Graduate Married $40K - $60K Blue 31 5 3 2 8687.000 1146 7541.000 1.800 2279 33 1.357 0.132
115 Existing Customer 49 M 1 Graduate Single $80K - $120K Blue 36 6 2 2 18886.000 895 17991.000 1.826 1235 18 1.571 0.047
18 Existing Customer 61 M 1 High School Married $40K - $60K Blue 56 2 2 3 3193.000 2517 676.000 1.831 1336 30 1.143 0.788
295 Existing Customer 60 M 0 High School Married $40K - $60K Blue 36 5 1 3 3281.000 837 2444.000 1.859 1424 29 1.417 0.255
855 Existing Customer 39 F 2 Graduate Married Unknown Blue 31 4 2 3 1438.300 997 441.300 1.867 2583 47 0.958 0.693
117 Existing Customer 50 M 3 High School Single $80K - $120K Blue 39 4 1 4 9964.000 1559 8405.000 1.873 1626 25 0.786 0.156
1176 Existing Customer 34 M 2 College Married $80K - $120K Blue 22 4 2 4 1631.000 0 1631.000 1.893 2962 57 1.111 0.000
869 Existing Customer 39 M 2 College Married $60K - $80K Blue 35 4 3 2 7410.000 2517 4893.000 1.924 2398 37 1.176 0.340
88 Existing Customer 44 M 3 High School Single $60K - $80K Blue 31 4 3 1 12756.000 837 11919.000 1.932 1413 14 1.800 0.066
6 Existing Customer 51 M 4 NaN Married $120K + Gold 46 6 1 3 34516.000 2264 32252.000 1.975 1330 31 0.722 0.066
142 Existing Customer 54 M 4 Graduate Married $80K - $120K Blue 34 2 3 2 14926.000 2517 12409.000 1.996 1576 25 1.500 0.169
431 Existing Customer 47 F 4 NaN Divorced $40K - $60K Blue 34 6 1 2 3502.000 1851 1651.000 2.023 1814 31 0.722 0.529
1873 Existing Customer 38 M 3 Uneducated Married $60K - $80K Blue 36 5 2 3 3421.000 2308 1113.000 2.037 2269 39 1.053 0.675
1085 Existing Customer 45 F 3 Graduate Single Unknown Blue 36 3 3 4 11189.000 2517 8672.000 2.041 2959 58 1.231 0.225
177 Existing Customer 67 F 1 Graduate Married Less than $40K Blue 56 4 3 2 3006.000 2517 489.000 2.053 1661 32 1.000 0.837
1219 Existing Customer 38 F 4 Graduate Married Unknown Blue 28 4 1 2 6861.000 1598 5263.000 2.103 2228 39 0.950 0.233
154 Existing Customer 53 F 1 College Married Less than $40K Blue 47 4 2 3 2154.000 930 1224.000 2.121 1439 26 1.364 0.432
284 Existing Customer 61 M 0 Graduate Married $40K - $60K Blue 52 3 1 2 2939.000 1999 940.000 2.145 2434 33 1.538 0.680
4 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
841 Existing Customer 37 F 3 NaN Married Less than $40K Blue 25 6 2 1 1438.300 674 764.300 2.180 1717 31 0.722 0.469
7 Existing Customer 32 M 0 High School NaN $60K - $80K Silver 27 2 2 2 29081.000 1396 27685.000 2.204 1538 36 0.714 0.048
466 Existing Customer 63 M 2 Graduate Married $60K - $80K Blue 49 5 2 3 14035.000 2061 11974.000 2.271 1606 30 1.500 0.147
58 Existing Customer 44 F 5 Graduate Married Unknown Blue 35 4 1 2 6273.000 978 5295.000 2.275 1359 25 1.083 0.156
658 Existing Customer 46 M 4 Graduate Married $60K - $80K Blue 35 5 1 2 1535.000 700 835.000 2.282 1848 25 1.083 0.456
46 Existing Customer 56 M 2 Doctorate Married $60K - $80K Blue 45 6 2 0 2283.000 1430 853.000 2.316 1741 27 0.588 0.626
47 Existing Customer 59 M 1 Doctorate Married $40K - $60K Blue 52 3 2 2 2548.000 2020 528.000 2.357 1719 27 1.700 0.793
219 Existing Customer 44 F 3 Uneducated Divorced Less than $40K Silver 38 4 1 3 11127.000 1835 9292.000 2.368 1546 25 1.273 0.165
2 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
773 Existing Customer 61 M 0 Post-Graduate Married Unknown Blue 53 6 2 3 14434.000 1927 12507.000 2.675 1731 32 3.571 0.134
8 Existing Customer 37 M 3 Uneducated Single $60K - $80K Blue 36 5 2 0 22352.000 2517 19835.000 3.355 1350 24 1.182 0.113
12 Existing Customer 56 M 1 College Single $80K - $120K Blue 36 3 6 0 11751.000 0 11751.000 3.397 1539 17 3.250 0.000
  • It appeared from the graph that values above 2.5 are outliers, but since the data is skewed, several more outliers fall within 4* IQR from the median. Same would apply to Total_Cnt_Chng_Q4_Q1.
  • We will do transformations of features with a severe skew on one side instead of dropping the points outside 4* IQR. This may cause less points to be outliers.

Bivariate Analysis¶

Explore relationships between variables.

In [682]:
# correlation plot
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • Credit_Limit and Avg_Open_To_Buy are perfectly correlated. We will drop one of these.
  • Months_on_Book and Customer_Age have a high positive correlation.
  • Total_Trans_Amt and Total_Trans_Ct have a high positive correlation.
In [683]:
# plot numerical features against each other
sns.pairplot(data=df, hue="Attrition_Flag", vars=num_cols, corner=True)
plt.show()
  • We can see the clear linear relationship between Credit_Limit and Avg_Open_to_Buy, and with Months_on_book and Customer_Age.
  • Some relationships between variables appear non-linear such as between Credit_Limit or Avg_Open_To_Buy and Avg_Utilization_Ratio
  • There are clear separations between the distributions of existing and attritued customers with respect to transaction amounts and counts.
In [684]:
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
In [685]:
# plot numerical variables with respect to target
sns.set(font_scale=1)
for i in num_cols:
    distribution_plot_wrt_target(df, i, "Attrition_Flag")
    plt.show()
    print("*" * 100)
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
  • Customers who attrited and those who did not do not show much difference in Age, Months_on_book, number of dependents. The median age of attrited customers is only slightly higher.
  • Customers who attrited and those who did not, have dependents ranging from 0-5 and are concentrated around having 2 and 3 dependents.
  • The median number of products held by attrited customers is less than other customers. This could be an indication of diminishing interest in the bank.
  • Most customers are inactive for 2-3 months in the last 12 months, regardless of whether they attrited.
  • Only attrited customers were contacted 6 times in the last 12 months.
  • Existing customers have a slightly higher median credit limit than attrited customers.
  • Existing customers have a higher median total revolving balance than attrited customers.
  • Median Avg_Open_To_Buy is similar between attrited and existing customers.
  • Existing customers span a higher range of Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1 than attrited customers when considering outliers. Having higher transaction amounts and counts in the fourth quarter compared to first quarter means existing customers have an interest in continuing to use the credit card vs. the attrited customers.
  • For Total_Amt_Chng_Q4_Q1, the distribution is normally distributed for attrited customers and skewed to the right for existing customers.
  • Existing customers have much higher median total transaction amounts and transaction counts than attrited customers.
  • The bimodal distribution in Total_Trans_Ct is seen only with the data of the existing customer, not with the attrited customers.
  • Existing customers have a higher median average utilization ratio than attrited customers.
In [686]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [687]:
# plot categorical variables with respect to target
othercols = cat_cols.copy()
othercols.remove("Attrition_Flag")
for i in othercols:
    print(i)
    stacked_barplot(df, i, "Attrition_Flag")
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
Gender
Attrition_Flag  Attrited Customer  Existing Customer    All
Gender                                                     
All                          1627               8500  10127
F                             930               4428   5358
M                             697               4072   4769
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Education_Level
Attrition_Flag   Attrited Customer  Existing Customer   All
Education_Level                                            
All                           1371               7237  8608
Graduate                       487               2641  3128
High School                    306               1707  2013
Uneducated                     237               1250  1487
College                        154                859  1013
Doctorate                       95                356   451
Post-Graduate                   92                424   516
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Marital_Status
Attrition_Flag  Attrited Customer  Existing Customer   All
Marital_Status                                            
All                          1498               7880  9378
Married                       709               3978  4687
Single                        668               3275  3943
Divorced                      121                627   748
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Income_Category
Attrition_Flag   Attrited Customer  Existing Customer    All
Income_Category                                             
All                           1627               8500  10127
Less than $40K                 612               2949   3561
$40K - $60K                    271               1519   1790
$80K - $120K                   242               1293   1535
$60K - $80K                    189               1213   1402
Unknown                        187                925   1112
$120K +                        126                601    727
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Card_Category
Attrition_Flag  Attrited Customer  Existing Customer    All
Card_Category                                              
All                          1627               8500  10127
Blue                         1519               7917   9436
Silver                         82                473    555
Gold                           21                 95    116
Platinum                        5                 15     20
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
  • More female than male customers attrited.
  • Customers with more advanced degrees like Doctorate or Post-graduate had higher proportion of attrited customers compared to less educated customers.
  • Proportion of attrited customers was similar across marital status.
  • Proportion of attrited customers was fairly similar across income categories.
  • Platinum card users are more likely to attrite than other card holder types.
In [730]:
# see how Marital_Status varies with other less significant variables since these did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
var_cols = [
    "Customer_Age",
    "Dependent_count",
    "Months_on_book",
    "Months_Inactive_12_mon",
    "Avg_Open_To_Buy",
]
for i, variable in enumerate(var_cols):
    plt.subplot(7, 2, i + 1)
    sns.boxplot(
        df["Marital_Status"], df[variable], hue=df["Attrition_Flag"], palette="PuBu"
    )
    plt.tight_layout()
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.title(variable)
plt.show()
  • There is not a significant difference across these numerical variables and Marital_Status for attrited and existing cusmters. The median Avg_Open_To_Buy is slightly higher for divorced attrited customers than divorced existing customers.
In [731]:
# see how Income_Category varies with other less significant variables since these did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
var_cols = [
    "Customer_Age",
    "Dependent_count",
    "Months_on_book",
    "Months_Inactive_12_mon",
    "Avg_Open_To_Buy",
]
for i, variable in enumerate(var_cols):
    plt.subplot(7, 2, i + 1)
    sns.boxplot(
        df["Income_Category"], df[variable], hue=df["Attrition_Flag"], palette="PuBu"
    )
    plt.tight_layout()
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.title(variable)
plt.show()
  • Median values of these numerical variables are similar for attrited and existing customers of each Income_Category, with the biggest differences seen in Months_Inactive_12_mon and Dependent_count.

Feature Transformations¶

We will apply a transformation on the highly skewed features so they are more normally distribued. This may also reduce the number of outliers.

In [688]:
# check min values of columns to see if any are 0 which is important to consider in log transformations
df[skew_cols].min()
Out[688]:
Credit_Limit            1438.300
Avg_Open_To_Buy            3.000
Total_Amt_Chng_Q4_Q1       0.000
Total_Trans_Amt          510.000
Total_Ct_Chng_Q4_Q1        0.000
Avg_Utilization_Ratio      0.000
dtype: float64
In [689]:
# use sqrt transformation to get highly right-skewed variables more normally distributed
# create copy of data
df1 = df.copy()
# identify skewed variables that work better with sqrt transformations
sqrt_cols = [
    "Avg_Open_To_Buy",
    "Total_Amt_Chng_Q4_Q1",
    "Total_Ct_Chng_Q4_Q1",
    "Avg_Utilization_Ratio",
]
# create transformed features and plot them
for col in sqrt_cols:
    df1[col + "_sqrt"] = np.sqrt(df1[col])
    histogram_boxplot(df1, col + "_sqrt")
# dropping the original columns
df1.drop(sqrt_cols, axis=1, inplace=True)

# use log transformation to get highly right-skewed variables more normally distributed
# identify skewed variables that work better with log transformations
log_cols = [
    "Credit_Limit",
    "Total_Trans_Amt",
]
# create transformed features and plot them
for col in log_cols:
    df1[col + "_log"] = np.log(df1[col])
    histogram_boxplot(df1, col + "_log")
# dropping the original columns
df1.drop(log_cols, axis=1, inplace=True)
  • The skewness has decreased on the features, and median and mean values are closer together. Number of outliers has reduced on some of the features.
In [690]:
# calculate correlations of transformed features to see if we need to drop any highly correlated features
plt.figure(figsize=(15, 7))
sns.heatmap(df1.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • Avg_Open_To_Buy_sqrt and Credit_Limit_log are highly correlated. We will drop one of these.
  • Total_Trans_Amt_log and Total_Trans_Ct are high correlated (greater than 0.85). We will drop one of these.

Drop highly correlated columns¶

In [691]:
# create copy of data
df2 = df1.copy()
# drop highly correlated columns
df2.drop(columns=["Avg_Open_To_Buy_sqrt", "Total_Trans_Amt_log"], inplace=True)

Splitting data¶

In [692]:
# separating target variable from other variables
X = df2.drop(columns="Attrition_Flag")
y = df2["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
In [693]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 17) (2026, 17) (2026, 17)
In [694]:
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075
Number of rows in validation data = 2026
Number of rows in test data = 2026
In [695]:
print("Split of 0 and 1 in training data:\n", y_train.value_counts(normalize=True))
print("Split of 0 and 1 in validation data:\n", y_val.value_counts(normalize=True))
print("Split of 0 and 1 in test data:\n", y_test.value_counts(normalize=True))
Split of 0 and 1 in training data:
 0   0.839
1   0.161
Name: Attrition_Flag, dtype: float64
Split of 0 and 1 in validation data:
 0   0.839
1   0.161
Name: Attrition_Flag, dtype: float64
Split of 0 and 1 in test data:
 0   0.840
1   0.160
Name: Attrition_Flag, dtype: float64

Preprocessing Data: Missing value treatment, Standard Scaling, Creating dummies¶

In [696]:
# recall which features are missing values
df2.isna().sum()[df2.isna().sum() > 0]
Out[696]:
Education_Level    1519
Marital_Status      749
dtype: int64
  • We will impute missing values in the categorical variables with the mode and in case there are missing values in the numerical variables in future data, we will impute with the median.
In [697]:
# creating a list of categorical variables
categorical_features = X_train.select_dtypes(include=["category"]).columns.tolist()

# creating a transformer for categorical variables, which will first apply simple imputer and then do one hot encoding for categorical variables
# with one hot encoding, I will drop the first variable as logistic regression is affected by multicollinearity
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(drop="first")),
    ]
)
# creating a list of numerical variables
numerical_features = X_train.select_dtypes(
    include=["int64", "float64"]
).columns.tolist()
# creating a transformer for numerical variables, which will apply standard scaling on the numerical variables
numeric_transformer = Pipeline(
    steps=[
        ("imputer_num", SimpleImputer(strategy="median")),
        ("standard scaler", StandardScaler()),
    ]
)
In [698]:
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="passthrough",
)
In [699]:
# fit the pipeline to the training data
preprocessor.fit(X_train)

# apply the pipeline to the training and test data
X_train_t = preprocessor.transform(X_train)
X_val_t = preprocessor.transform(X_val)
X_test_t = preprocessor.transform(X_test)
In [700]:
print(X_train_t.shape, X_val_t.shape, X_test_t.shape)
(6075, 28) (2026, 28) (2026, 28)

Part 3: Model Evaluation¶

Model evaluation criterion:¶

Model can make wrong predictions as:¶

  1. Predicting a customer will renounce credit card services but in reality, the customer is not going to do that - Loss of resources
  2. Predicting a customer will not renounce credit card services but the customer will - Loss of revenue

Which case is more important?¶

  • Predicting a customer will not renounce credit card services but the customer does would be more detrimental to the busines since that will be a loss in revenue for the company.

How to reduce this loss?¶

  • recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.
In [701]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [702]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Part 4: Model Building¶

Basic Models¶

In [703]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("logr", LogisticRegression(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))

names = []  # Empty list to store name of the models
score_train = []
score_val = []

# loop through all models to get the training performance
for name, model in models:
    model.fit(X_train_t, y_train)
    scores = model_performance_classification_sklearn(model, X_train_t, y_train)
    score_train.append(scores.T)
    names.append(name)
# create dataframe of results
models_train_basic = pd.concat(score_train,axis=1)
models_train_basic.columns = [names]

# loop through all models to get the validation performance
for name, model in models:
    model.fit(X_train_t, y_train)
    scores = model_performance_classification_sklearn(model, X_val_t, y_val)
    score_val.append(scores.T)
# create dataframe of results
models_val_basic = pd.concat(score_val,axis=1)
models_val_basic.columns = [names]

print("Training Performance Comparison:")
models_train_basic    
Training Performance Comparison:
Out[703]:
logr dtree Bagging GBM Adaboost Xgboost
Accuracy 0.897 1.000 0.993 0.943 0.928 1.000
Recall 0.519 1.000 0.960 0.742 0.719 0.999
Precision 0.764 1.000 0.996 0.887 0.813 1.000
F1 0.618 1.000 0.978 0.808 0.763 0.999
In [704]:
print("\n" "Validation Performance Comparison:" "\n")
models_val_basic
Validation Performance Comparison:

Out[704]:
logr dtree Bagging GBM Adaboost Xgboost
Accuracy 0.907 0.888 0.923 0.939 0.929 0.938
Recall 0.580 0.644 0.656 0.733 0.730 0.758
Precision 0.784 0.656 0.829 0.869 0.810 0.843
F1 0.667 0.650 0.733 0.795 0.768 0.798
  • When looking at recall, the models that are not overfitting to the training set are logistic regression, gradient boosting, and AdaBoost.
  • Decision tree, Bagging, and XGBoost are overfitting to the training set.
  • Logistic regression has the worst recall performance of all the models.
  • Gradient boosting and AdaBoost had good general performance without overfitting.

Models with Oversampled Data¶

In [705]:
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1
)  # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train_t, y_train)


print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After Oversampling, the shape of X_train: {}".format(X_train_over.shape))
print("After Oversampling, the shape of y_train: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976
Before Oversampling, counts of label 'No': 5099 

After Oversampling, counts of label 'Yes': 5099
After Oversampling, counts of label 'No': 5099 

After Oversampling, the shape of X_train: (10198, 28)
After Oversampling, the shape of y_train: (10198,) 

In [706]:
models_over = []  # Empty list to store all the models

# Appending models into the list
models_over.append(("logr_over", LogisticRegression(random_state=1)))
models_over.append(("dtree_over", DecisionTreeClassifier(random_state=1)))
models_over.append(("Bagging_over", BaggingClassifier(random_state=1)))
models_over.append(("GBM_over", GradientBoostingClassifier(random_state=1)))
models_over.append(("Adaboost_over", AdaBoostClassifier(random_state=1)))
models_over.append(
    ("Xgboost_over", XGBClassifier(random_state=1, eval_metric="logloss"))
)

names_over = []  # Empty list to store name of the models
score_train_over = []
score_val_over = []

# loop through all models to get the training performance
for name, model in models_over:
    model.fit(X_train_over, y_train_over)
    scores = model_performance_classification_sklearn(model, X_train_over, y_train_over)
    score_train_over.append(scores.T)
    names_over.append(name)
# create dataframe of results
models_train_over = pd.concat(score_train_over, axis=1)
models_train_over.columns = [names_over]

# loop through all models to get the validation performance
for name, model in models_over:
    model.fit(X_train_over, y_train_over)
    scores = model_performance_classification_sklearn(model, X_val_t, y_val)
    score_val_over.append(scores.T)
# create dataframe of results
models_val_over = pd.concat(score_val_over, axis=1)
models_val_over.columns = [names_over]

print("Training Performance Comparison:")
models_train_over
Training Performance Comparison:
Out[706]:
logr_over dtree_over Bagging_over GBM_over Adaboost_over Xgboost_over
Accuracy 0.842 1.000 0.997 0.961 0.933 1.000
Recall 0.845 1.000 0.997 0.960 0.940 1.000
Precision 0.840 1.000 0.998 0.961 0.926 1.000
F1 0.842 1.000 0.997 0.961 0.933 1.000
In [707]:
print("\n" "Validation Performance Comparison:" "\n")
models_val_over
Validation Performance Comparison:

Out[707]:
logr_over dtree_over Bagging_over GBM_over Adaboost_over Xgboost_over
Accuracy 0.834 0.892 0.919 0.928 0.903 0.940
Recall 0.831 0.730 0.730 0.794 0.788 0.782
Precision 0.490 0.647 0.758 0.766 0.669 0.833
F1 0.617 0.686 0.744 0.780 0.724 0.807
  • Logistic Regression is performing much better on recall with oversampled data. Performance improved over 20%.
  • There is now more overfitting with Gradient Boosting and AdaBoost models in recall than before.
  • All models except logistic regression are overfitting to the training set in recall. Decision Tree and XGBoost have perfect prediction on the training set.
  • Models with the highest recall performance on validation set are logistic regression, Graident Boosting and AdaBoost.

Models with Undersampled Data¶

In [708]:
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train_t, y_train)
print("Before Undersampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Undersampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Undersampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Undersampling, the shape of X_train: {}".format(X_train_un.shape))
print("After Undersampling, the shape of y_train: {} \n".format(y_train_un.shape))
Before Undersampling, counts of label 'Yes': 976
Before Undersampling, counts of label 'No': 5099 

After Undersampling, counts of label 'Yes': 976
After Undersampling, counts of label 'No': 976 

After Undersampling, the shape of X_train: (1952, 28)
After Undersampling, the shape of y_train: (1952,) 

In [709]:
models_un = []  # Empty list to store all the models

# Appending models into the list
models_un.append(("logr_un", LogisticRegression(random_state=1)))
models_un.append(("dtree_un", DecisionTreeClassifier(random_state=1)))
models_un.append(("Bagging_un", BaggingClassifier(random_state=1)))
models_un.append(("GBM_un", GradientBoostingClassifier(random_state=1)))
models_un.append(("Adaboost_un", AdaBoostClassifier(random_state=1)))
models_un.append(("Xgboost_un", XGBClassifier(random_state=1, eval_metric="logloss")))

names_un = []  # Empty list to store name of the models
score_train_un = []
score_val_un = []

# loop through all models to get the training performance
for name, model in models_un:
    model.fit(X_train_un, y_train_un)
    scores = model_performance_classification_sklearn(model, X_train_un, y_train_un)
    score_train_un.append(scores.T)
    names_un.append(name)
# create dataframe of results
models_train_un = pd.concat(score_train_un, axis=1)
models_train_un.columns = [names_un]

# loop through all models to get the validation performance
for name, model in models_un:
    model.fit(X_train_un, y_train_un)
    scores = model_performance_classification_sklearn(model, X_val_t, y_val)
    score_val_un.append(scores.T)
# create dataframe of results
models_val_un = pd.concat(score_val_un, axis=1)
models_val_un.columns = [names_un]

print("Training Performance Comparison:")
models_train_un
Training Performance Comparison:
Out[709]:
logr_un dtree_un Bagging_un GBM_un Adaboost_un Xgboost_un
Accuracy 0.835 1.000 0.990 0.940 0.892 1.000
Recall 0.829 1.000 0.983 0.942 0.894 1.000
Precision 0.838 1.000 0.998 0.939 0.891 1.000
F1 0.834 1.000 0.990 0.940 0.893 1.000
In [710]:
print("\n" "Validation Performance Comparison:" "\n")
models_val_un
Validation Performance Comparison:

Out[710]:
logr_un dtree_un Bagging_un GBM_un Adaboost_un Xgboost_un
Accuracy 0.824 0.817 0.887 0.899 0.881 0.896
Recall 0.850 0.813 0.850 0.877 0.896 0.902
Precision 0.474 0.461 0.606 0.634 0.585 0.622
F1 0.608 0.588 0.708 0.736 0.708 0.736
  • Logistic regression still performs well on recall and is not overfitting.
  • Overall, validation performance has improved over the models with oversampled data.
  • Overfitting has decreased from performance with oversampled data.
  • Decision Tree, Bagging, and XGBoost are still overfitting greatly to the training set.
  • AdaBoost and Gradient Boosting models are showing high validation performance on recall with little overfitting.

Hyperparameter Tuning With Random Search - Top 3¶

The top 3 models with good performance on training and test sets and had less overfitting are Logistic Regression, AdaBoost, and Gradient Boosting with undersampled data. We will tune these three models.

Logistic Regression Hyperparameter Tuning¶

In [711]:
# define the model
logr_tuned = LogisticRegression(random_state=1)
# Grid of parameters to choose from
from scipy.stats import loguniform

param_grid = {
    "solver": ["newton-cg", "lbfgs", "liblinear"],
    "penalty": ["none", "l1", "l2", "elasticnet"],
    "C": loguniform(1e-5, 100),
}
# Run the random search
randomized_cv_logr = RandomizedSearchCV(
    estimator=logr_tuned,
    param_distributions=param_grid,
    n_jobs=-1,
    n_iter=100,
    scoring="recall",
    cv=5,
    random_state=1,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv_logr.fit(X_train_un, y_train_un)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv_logr.best_params_, randomized_cv_logr.best_score_
    )
)
Best parameters are {'C': 0.003131281159444946, 'penalty': 'l1', 'solver': 'liblinear'} with CV score=0.8924437467294609:
In [712]:
# Set the clf to the best combination of parameters
logr_tuned = randomized_cv_logr.best_estimator_
# Fit the best algorithm to the data.
logr_tuned.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
logr_train_perf = model_performance_classification_sklearn(
    logr_tuned, X_train_un, y_train_un
)
print("Training performance:")
logr_train_perf
Training performance:
Out[712]:
Accuracy Recall Precision F1
0 0.759 0.892 0.704 0.787
In [713]:
# Calculating different metrics on validation set
logr_val_perf = model_performance_classification_sklearn(logr_tuned, X_val_t, y_val)
print("Validation performance:")
logr_val_perf
Validation performance:
Out[713]:
Accuracy Recall Precision F1
0 0.680 0.887 0.321 0.471
  • Model shows good recall performance and is generalizing well. However, precision is very low for the validation set.

Gradient Boosting Hyperparameter Tuning¶

In [714]:
# define the model
gb_tuned = GradientBoostingClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
    "init": [
        AdaBoostClassifier(random_state=1),
        DecisionTreeClassifier(random_state=1),
    ],
    "n_estimators": np.arange(70, 150, 10),
    "learning_rate": [0.1, 0.01, 0.05, 0.50, 0.20, 1],
    "subsample": [0.3, 0.5, 0.7, 0.8, 1],
    "max_features": [0.3, 0.5, 0.7, 0.8, 1],
}
# Run the random search
randomized_cv_gb = RandomizedSearchCV(
    estimator=gb_tuned,
    param_distributions=parameters,
    n_iter=100,
    scoring="recall",
    cv=5,
    random_state=1,
    n_jobs=-1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv_gb.fit(X_train_un, y_train_un)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv_gb.best_params_, randomized_cv_gb.best_score_
    )
)
Best parameters are {'subsample': 1, 'n_estimators': 90, 'max_features': 0.5, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8954683411826269:
In [715]:
# Set the clf to the best combination of parameters
gb_tuned = randomized_cv_gb.best_estimator_
# Fit the best algorithm to the data.
gb_tuned.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
gb_train_perf = model_performance_classification_sklearn(
    gb_tuned, X_train_un, y_train_un
)
print("Training performance:")
gb_train_perf
Training performance:
Out[715]:
Accuracy Recall Precision F1
0 0.964 0.964 0.963 0.964
In [716]:
# Calculating different metrics on validation set
gb_val_perf = model_performance_classification_sklearn(gb_tuned, X_val_t, y_val)
print("Validation performance:")
gb_val_perf
Validation performance:
Out[716]:
Accuracy Recall Precision F1
0 0.901 0.877 0.640 0.740
  • Model shows good performance on recall, but more overfitting than logistic regression.

AdaBoost Hyperparameter Tuning¶

In [717]:
# define the model
ada_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
param_grid = {
    "n_estimators": np.arange(10, 150, 10),
    "learning_rate": [0.1, 0.01, 0.05, 0.50, 0.20, 0.90, 1],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
        DecisionTreeClassifier(max_depth=5, random_state=1),
        DecisionTreeClassifier(max_depth=10, random_state=1),
        DecisionTreeClassifier(max_depth=30, random_state=1),
    ],
}
# Run the random search
randomized_cv_ada = RandomizedSearchCV(
    estimator=ada_tuned,
    param_distributions=param_grid,
    n_jobs=-1,
    n_iter=100,
    scoring="recall",
    cv=5,
    random_state=1,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv_ada.fit(X_train_un, y_train_un)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv_ada.best_params_, randomized_cv_ada.best_score_
    )
)
Best parameters are {'n_estimators': 140, 'learning_rate': 0.9, 'base_estimator': DecisionTreeClassifier(max_depth=10, random_state=1)} with CV score=0.8975248560962846:
In [718]:
# Set the clf to the best combination of parameters
ada_tuned = randomized_cv_ada.best_estimator_
# Fit the best algorithm to the data.
ada_tuned.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
ada_train_perf = model_performance_classification_sklearn(
    ada_tuned, X_train_un, y_train_un
)
print("Training performance:")
ada_train_perf
Training performance:
Out[718]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [719]:
# Calculating different metrics on validation set
ada_val_perf = model_performance_classification_sklearn(ada_tuned, X_val_t, y_val)
print("Validation performance:")
ada_val_perf
Validation performance:
Out[719]:
Accuracy Recall Precision F1
0 0.891 0.902 0.610 0.728
  • Model is overfitting to the data, more than previous models.

Part 5: Best Model and Test Results¶

In [720]:
# training performance comparison

models_train_comp_df = pd.concat(
    [logr_train_perf.T, gb_train_perf.T, ada_train_perf.T,], axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression Tuned",
    "Gradient Boosting Tuned",
    "AdaBoost Tuned",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[720]:
Logistic Regression Tuned Gradient Boosting Tuned AdaBoost Tuned
Accuracy 0.759 0.964 1.000
Recall 0.892 0.964 1.000
Precision 0.704 0.963 1.000
F1 0.787 0.964 1.000
In [721]:
# validation performance comparison

models_val_comp_df = pd.concat(
    [logr_val_perf.T, gb_val_perf.T, ada_val_perf.T,], axis=1,
)
models_val_comp_df.columns = [
    "Logistic Regression Tuned",
    "Gradient Boosting Tuned",
    "AdaBoost Tuned",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
Out[721]:
Logistic Regression Tuned Gradient Boosting Tuned AdaBoost Tuned
Accuracy 0.680 0.901 0.891
Recall 0.887 0.877 0.902
Precision 0.321 0.640 0.610
F1 0.471 0.740 0.728
  • Performance of models improved with hyperparameter tuning.
  • All models had similar recall scores on validation, but Logistic Regression had a low F1 and accuracy score.
  • Gradient Boosting and AdaBoosting had similar validation performance, but AdaBoost overfits more to the training set.
  • We will choose Gradient Boosting as the best model, and look at test performance.
In [767]:
# Calculating different metrics on the test set
gb_test_perf = model_performance_classification_sklearn(gb_tuned, X_test_t, y_test)
print("Test performance:")
gb_test_perf
Test performance:
Out[767]:
Accuracy Recall Precision F1
0 0.906 0.911 0.648 0.757
In [768]:
confusion_matrix_sklearn(gb_tuned, X_test_t, y_test)
  • Test performance is similar to how the model performed on validation set, proving good generalized model performance.
  • Recall score is above 90%.
In [769]:
# Feature importances
feature_names = pd.get_dummies(X_train, drop_first=True).columns
importances = gb_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • The most important features to the model are count of transactions and total revolving balance.

Part 6: Pipeline Model¶

A preprocessor was defined in earlier steps. We will use this in our final pipeline.

In [770]:
# Now we already know the best model we need to process with, so we don't need to divide data into 3 sets - train, validation and test
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(8101, 17) (2026, 17)
In [771]:
# Creating new pipeline with best parameters
# train model on undersampled data
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
model = Pipeline(
    steps=[
        ("pre", preprocessor),
        (
            "GBM",
            GradientBoostingClassifier(
                random_state=1,
                subsample=1,
                n_estimators=90,
                max_features=0.5,
                learning_rate=0.2,
                init=AdaBoostClassifier(random_state=1),
            ),
        ),
    ]
)
# Fit the model on training data
model.fit(X_train_un, y_train_un)
Out[771]:
Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num',
                                                  Pipeline(steps=[('imputer_num',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standard '
                                                                   'scaler',
                                                                   StandardScaler())]),
                                                  ['Customer_Age',
                                                   'Dependent_count',
                                                   'Months_on_book',
                                                   'Total_Relationship_Count',
                                                   'Months_Inactive_12_mon',
                                                   'Contacts_Count_12_mon',
                                                   'Total_Revolving_Bal',
                                                   'Total_Tra...
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['Gender', 'Education_Level',
                                                   'Marital_Status',
                                                   'Income_Category',
                                                   'Card_Category'])])),
                ('GBM',
                 GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                                            learning_rate=0.2, max_features=0.5,
                                            n_estimators=90, random_state=1,
                                            subsample=1))])
In [772]:
# Creating new pipeline with best parameters and including randomundersampler in pipeline
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.pipeline import Pipeline as Pipeline_imb

model2 = Pipeline_imb(
    steps=[
        ("pre", preprocessor),
        ("rus", RandomUnderSampler(random_state=1)),
        (
            "GBM",
            GradientBoostingClassifier(
                random_state=1,
                subsample=1,
                n_estimators=90,
                max_features=0.5,
                learning_rate=0.2,
                init=AdaBoostClassifier(random_state=1),
            ),
        ),
    ]
)
# Fit the model on training data
model2.fit(X_train, y_train)
Out[772]:
Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num',
                                                  Pipeline(steps=[('imputer_num',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standard '
                                                                   'scaler',
                                                                   StandardScaler())]),
                                                  ['Customer_Age',
                                                   'Dependent_count',
                                                   'Months_on_book',
                                                   'Total_Relationship_Count',
                                                   'Months_Inactive_12_mon',
                                                   'Contacts_Count_12_mon',
                                                   'Total_Revolving_Bal',
                                                   'Total_Tra...
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['Gender', 'Education_Level',
                                                   'Marital_Status',
                                                   'Income_Category',
                                                   'Card_Category'])])),
                ('rus', RandomUnderSampler(random_state=1)),
                ('GBM',
                 GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                                            learning_rate=0.2, max_features=0.5,
                                            n_estimators=90, random_state=1,
                                            subsample=1))])
In [773]:
# Calculating different metrics on test set
gb_test_perf = model_performance_classification_sklearn(model2, X_test, y_test)
print("Test performance:")
gb_test_perf
Test performance:
Out[773]:
Accuracy Recall Precision F1
0 0.914 0.935 0.665 0.777
In [774]:
confusion_matrix_sklearn(model2, X_test, y_test)
  • Test performance is similar to how the model performed on the previous validation set, proving good generalized model performance with a high recall score.

Part 7: Conclusions and Recommendations¶

Summary/Insights¶

Data Background:

  • There are 10127 rows and 21 columns in the data.
  • There are no duplicated rows.
  • All columns were object, float or int type.
  • 2 columns had missing values.

Data Preprocessing:

  • The CLIENTNUM column was dropped as it isn’t important to the analysis.
  • Replaced "abc" with "Unknown" in Income_Category.
  • Categorical variables were converted to category data type.
  • Outliers were identified in several of the highly skewed variables. To address this, highly skewed variables were transformed using log or sqrt functions based on which function made the distribution more normally distributed.
  • "Avg_Open_To_Buy_sqrt", "Total_Trans_Amt_log" were dropped from the model due to being highly correlated to other features.
  • Attrition_Flag was coded as 0 and 1.
  • Training data: 80%, Validation data: 20%, Test data: 20%
  • Imputed missing values in categorical features based on the most frequent value in the training data.
  • Performed one hot encoding on categorical variables.
  • Standardized all numerical features.

Observations from EDA:

  • About 16% of customers have churned their credit cards.
  • Most customers are Female, Graduate level, Married, make less than 40k, and have the Blue card.
  • The median number of products held by attrited customers is less than existing customers. This could be an indication of diminishing interest in the bank.
  • Existing customers have a higher median total revolving balance than attrited customers.
  • Existing customers have much higher median total transaction amounts and transaction counts than attrited customers. They also report higher positive change in usage between Q1 and Q4.
  • Existing customers have a higher median average utilization ratio than attrited customers.
  • Platinum card users are more likely to attrite than other card holder types.

Model Building and Performance:

Models were created to predict whether or not a customer will renounce credit card services. Recall was the chosen model evaluation metric to minimize false negatives. A decision tree, logistic regression, bagging classifier, AdaBoost classifier, Gradient Boosting classifier, XGBoost Classifier were built with default parameters, with oversampled data, and with undersampled data. The top three models (logistic regression, Gradient Boosting, and AdaBoost) were then tuned with hyperparameters using random search. Gradient Boosting had the best performance with the less overfitting. The final performance gave above 0.9 on the test set. Based on the chosen Gradient Boosting model, count of transactions and total revolving balance are the most significant variables for determining whether a customer will churn credit card services. Finally, a pipeline was created that preprocessed categorical and numerical features of split data, fit a model, and made predictions.

Recommendations¶

  • A machine learning model was successfully built that can be used by the company to predict which customers will churn credit card services.
  • The bank should watch for customers that have low total revolving balances and report low transaction amounts and transaction counts. These customers are more likely to churn services.
  • Attrited customers are more likely to hold fewer products from the bank. It is recommended that the bank advertise other credit card types to the customer so they find one that fits their preferences and spending habits better.
  • It is recommended that the bank track transaction counts and amounts and utilization rates between each quarter of the year to be able to detect decreases and reach out to those customers early.
  • It is recommended that the business gather more data on other features like use of online banking services, if they have other accounts with the bank, or have credit cards at other banks.
  • If customers have cards at other banks, it is recommended for the bank to research how those are different and how the bank is able to compete with other credit cards on the market, and relay that to the customers.
  • The bank should look into not only the number but the type of contacts being made between the bank and customer, and whether they have been positive interactions.
  • It is recommended to re-build the model using grid search and other feature transformations and dropped variables to see whether the performance can be improved.
In [ ]: